Predictors of Low Agreement Between Automated Speech Recognition and Human Scores 


Joseph F. T. Nese 
Josh Kahn 
Akihito Kamata 


April, 2017 
Poster presented at the National Council on Measurement in Education annual meeting 


The research reported here was supported by the Institute of Education Sciences, U.S. 
Department of Education, through Grant R305A 140203 to the University of Oregon. The 
opinions expressed are those of the authors and do not represent views of the Institute or the U.S. 
Department of Education. 


Abstract 


Despite prevalent use and practical application, the current and standard assessment of oral 
reading fluency (ORF) presents considerable limitations which reduces its validity in estimating 
growth and monitoring student progress, including: (a) high cost of implementation; (b) tenuous 
passage equivalence; and (c) bias, large standard error, and tenuous reliability. To address these 
limitations, the Computerized Oral Reading Evaluation (CORE) system contains an automated 
scoring algorithm based on a speech recognition engine and a novel latent variable psychometric 
model. The purpose of this study is to investigate potential student and passage predictors of low 
agreement between an automated speech recognition (ASR) engine and human scores of words 
read correctly in student oral reading fluency passages. We fit a cross-classified, variable 
exposure Poisson model to estimate agreement and found that the majority of variance was found 
at the student and recording levels, and that student demographic variables explained only a 


small amount (13%) of the student-level variance. 


Conceptual Framework 

Assessing oral reading fluency (ORF) is critical because it functions as an indicator of 
comprehension and overall reading achievement (e.g., Deno, 1985; Hosp & Fuchs, 2005; 
Marston, 1989). Research indicates that reading fluency should be regularly assessed in the 
classroom so an instructional response can be made when a difficulty is identified (e.g., Snow, 
Burns, & Griffin, 1998). ORF curriculum-based measurement (CBM) is used to identify students 
at-risk for poor learning outcomes through screening assessments, and to monitor student 
progress to help guide and inform instructional decision-making (e.g., Fuchs, Fuchs, Hosp, & 
Jenkins, 2001; Speece, Case, & Molloy, 2003). 

Despite prevalent use and practical application, the current and standard assessment of 
ORF presents considerable limitations which reduces its validity in estimating growth and 
monitoring student progress, including: (a) high cost of implementation; (b) tenuous passage 
equivalence; and (c) bias, large standard error, and tenuous reliability. 

To address these limitations, the Computerized Oral Reading Evaluation (CORE) system 
contains an automated scoring algorithm based on a speech recognition engine and a novel latent 
variable psychometric model. Recent research on this system has shown that (a) mean error rates 
(proportion of words that were scored as incorrect) for a passage were highest for ASR (Table 1), 
(b) the agreement rate (kappa; Cohen, 1960) between ASR and human scores was about .88, on 
average, for both students and passages, but the SD was quite different (Table 2; about .15 for 
students and .03 for passages; Nese, Alonzo, Kamata, 2016; Nese, Kamata, Alonzo, 2015). The 
purpose of this study is to build upon prior research and investigate potential predictors of low 


agreement between ASR and human word scores. 


The purpose of this study is to investigate potential student and passage predictors of low 
agreement between an automated speech recognition (ASR) engine and human scores of words 
read correctly in student oral reading fluency passages. We fit a multi-level, cross-classified IRT 
model to model a latent estimate of agreement. 

Research Questions 

This study investigates potential student and passage predictors of low agreement 
between an automated speech recognition (ASR) engine and human scores of words read 
correctly in student oral reading fluency (ORF) passages. Our research questions are: 

(1) How is the variance in latent agreement estimates partitioned at the student and passage 
levels? 
(2) What student and passage variables predict latent agreement estimates? 
Methods 

Sample. The sample includes 560 students in Grades 2, 3, and 4 across two school 
districts in Oregon. See Table 3 for sample descriptive statistics. 

Measures. The traditional ORF measures were taken from the easyCBM online 
screening and progress monitoring assessment system (Alonzo, Tindal, Ulmer, & Glasgow, 
2006). Each passage was created to be consistent in length (250 words) and the readability of 
each form was verified to fit appropriate grade-level, initially using the Flesch-Kincaid index 
(e.g., Alonzo & Tindal, 2008), with later empirical support through applications in the field. 

The CORE passages are original works of fiction, +5 words of the target word length 
(short ~ 25, medium ~ 50, long ~ 85). Passages were written with grade-appropriate vocabulary 
and word frequency so that an average of several well-respected readability scores was estimated 


to be at grade-level. 


The word accuracy (correct or incorrect) of all passages was scored by trained human 
assessors via audio recordings (human), and an automated speech recognition engine (ASR). All 
students were administered the passages via computer: one traditional ORF passage, and 15-18 
CORE passages (2-3 long; 3-5 medium; and 8-10 short). 

Analysis. To explore the factors that may contribute to poor agreement between machine 
and human scores, we fit a cross-classified variable exposure Poisson model 

log(A) = Bx + log (w) 
Where J is disagreement between ASR and human scores (0 = both scored the word read as 
correctly or incorrectly; 1 = one scored word as correct, the other as incorrect), and w is the 
exposure variable (total number of words per recorded audio), with random effects for the 
student and passage, and fixed effects for student gender, disability status, and English Learner 
(EL) status, and recording duration. We used the Ime4 package (Bates, Maechler, Bolker, & 
Walker, 2015) in the R programming language (R Core Team, 2016) to conduct the analyses. 

The baseline model (m0) was specified as follows: 
mO <- glmer(disagree sum ~ 1 + offset(log(total_words)) + 

(1|recording) + (1|student) + (l|passage), family = poisson) 
The comparison model (m1) was specified as follows: 
ml <- glmer(disagree sum ~ 1 + offset (log(total_words)) + gender 
+ disability + el + recording duration + {1 |recording) + 
(1|student) + (l|passage), family = poisson) 
Results 
Results showed that m1 explained approximately 10% of the mO variance, with: no 


variance explained at the recording-level; gender, disability status, and EL status explaining 13% 


of variance at the student-level; and recording duration explaining 53% of variance at the 
passage-level (see Table 4). Note that additional passage covariates (e.g., Flesch-Kincaid, 
average word length) accounted for an additional 4% of variance at the passage-level. 

See Table 5 for the fixed effect model results. The rate of disagreement for the intercept 
(female, non-disability, non-ELL, Grade 3, average recording duration) was 0.06. All else 
constant, the rate of disagreement was 1.86 times higher for students with disabilities than 
students without disabilities. All else constant, the rate of disagreement for Grades 3 and 4 were 
about half the rate for Grade 2. For a standard deviation increase in the recording duration (18.5 
seconds), the rate of disagreement rate increased by about 10%. 

Conclusion 

In response to our first research question, there was only a small proportion of variance at 
the passage level; the majority of variance was found at the student and recording levels. In 
response to our second research question student demographic variables explained a moderate 
amount of the student-level variance (13%), and we were unable to explain any of the variance 
associated at the recording level (note that we did have a human rating of “audio quality” for 
about 30% of the recordings, but this variable did not reduce variance at any level in a 
meaningful way). The results of this study have the potential to begin to understand how the 
ASR scores readings of English learners or students with disabilities, to inform the refinement of 
the CORE system by identifying predictors that may indicate (a priori) an unreliable ASR score, 
and to identify text properties that degrade ASR scoring so that future models can be trained on 


these features. 
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Table 1 


Comparisons of Error Rate Means and (SD) across Human and ASR, and Three CORE Passage 
Lengths 


Short Medium Long 
25 words 50words 85 words 
Grade 2 (n = 127) Human — .09(.10) 05 (.07) 05 (.07) 
ASR .11 (13) .08 (.10) .09 (.10) 


Grade 3 (n = 158) Human .06(.31) =.05(.07) __.03 (.05) 
ASR 07 (.09) = .07(.09) _.06 (.07) 


Grade 4 (n = 162) Human .07(.11) ~=.04(.08) ~—.04 (.05) 
ASR 08 (.12) 06.09) _.05 (.07) 


Table 2 


Word-level Agreement (Cohen’s kappa) Comparisons between Human and ASR at the Student 
and Passage Levels, Across Grades 


Grade 2 Grade 3 Grade 4 
Student Passage Student Passage Student Passage 
(n=127)  (n=54) (n=158) (n=54) (n=162) (n=52) 
Mean 82 83 90 .90 91 91 
SD .20 .04 14 .03 2 .03 
min .16 73 .09 84 20 84 


max 99 .90 99 .96 1.00 .96 


Table 3 


Sample Descriptive Statistics 


Grade 
2 
3 
4 
Sex 
Female 
Ethnicity 
Hispanic/Latino 
Race 
American Indian/Alaskan Native 
Asian 
Black 
Multi-Race 
Native Hawaiian/Pacific Islander 
Non-US Native American 
Pacific Islander 
White 
Disability 
English Learners 


105 


15 


11 


Table 4 
Random Effect Variance of Models 


m0 m1 

“Explained 

% of % of Variance” 

Groups Variance total mO Variance total m1 (mO-m1) /m0 
recordings 0.71 47% 0.73 53% -2% 
students 0.67 44% 0.59 43% 13% 
passages 0.14 9% 0.07 5% 53% 
Total 153 1.38 10% 


Table 5 
Model Fixed Effects 


(Intercept) 

Male 

Disability 

ELL 

Grade 2 

Grade 4 

Recording duration 
* p < 001. 


Estimate 
-3.36* 
0.03* 
0.62* 
0.22* 
0.54* 
-0.14* 
0.01* 


SE 
0.06 
0.07 
0.10 
0.11 
0.06 
0.07 
0.00 


log(Estimate) 
0.06 
1.03 
1.86 
1.24 
Ley 
0.87 
1.01 


13 


